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In phylogenetic studies, the evolution of molecular sequences is assumed to have taken place along the phytogeny 
traced by the ancestors of extant species. In the presence of lateral gene transfer (LGT), however, this may not be the 
case, because the species lineage from which a gene was transferred may have gone extinct or not have been sampled. 
Because it is not feasible to specify or reconstruct the complete phylogeny of all species, we must describe the evolution 
of genes outside the represented phylogeny by modelling the speciation dynamics that gave rise to the complete phy- 
logeny. We demonstrate that if the number of sampled species is small compared to the total number of existing species, 
the overwhelming majority of gene transfers involve speciation to, and evolution along extinct or unsampled lineages. 
We show that the evolution of genes along extinct or unsampled lineages can to good approximation be treated as those 
of independently evolving lineages described by a few global parameters. Using this result, we derive an algorithm 
to calculate the probability of a gene tree and recover the maximum likelihood reconciliation given the phylogeny of 
the sampled species. Examining 473 near universal gene families from 36 cyanobacteria, we find that nearly a third of 
transfer events - 28% - appear to have topological signatures of evolution along extinct species, but only approximately 
6% of transfers trace their ancestry to before the conmion ancestor of the sampled cyanobacteria. 
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''From the first growth of the tree, many a limb 
and branch has decayed and dropped off; and 
these lost branches of various sizes may represent 
those whole orders, families, and genera which 
have now no living representatives, and which 
are known to us only from having been found in a 
fossil state.'' 

Charles Darwin, On the Origins of Species. Lon- 
don, 1859 

Most of the diversity of life that ever existed on earth has gone 
extinct and can only be glimpsed from the fossil record. Al- 
though the comparative approach allows the reconstruction 
of some morphological and genetical characteristics of an- 
cestral species, it is only informative for species that have 
founded extant lineages. Yet, the information enclosed in 
genome sequences is abundant and particularly meaningful 
for the reconstruction of the descent and evolution of their car- 
riers (Zuckerkandl and Pauling, 1965; Boussau and Daubin, 
2010; David and Aim, 2011) , so much so that it may have 
recorded accounts of extinct lineages. This possibility exists 
because the success of lateral gene transfer (LGT) as an evo- 
lutionary process implies that each gene possesses its own, 
unique history, which is not necessarily confined to the history 
of those species that have survived (Maddison, 1997; Galtier 
and Daubin, 2008; Foumier et al., 2009; Abby et al., 2012) . 
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Several models have recently been developed to reconcile 
seemingly contradictory gene phytogenies with the species 
phylogeny by tracing the path on the species phylogeny along 
which they evolved as result of a series of speciations, gene 
duplications, LGT and losses (Tofigh, 2009; Doyon et al., 
2010; David and Aim, 2011; SzoUosi and Daubin, 2012; 
SzoUosi et al., 2012) . None of these models, however, take 
into consideration the fact that, in the presence of LGT, gene 
trees record evolutionary paths along the complete species 
tree, including extinct and unsampled branches, and not only 
along the phylogeny of the species in which they reside today. 
This is the case because, as first noted by Maddison (Maddi- 
son, 1997) and later elaborated by Gogarten et al. (Zhaxy- 
bayeva and Gogarten, 2004; Fournier et al., 2009) , while 
LGT events imply that the donor and receiver lineages existed 
at the same time, the donor lineage might have subsequently 
become extinct, or more generally, might not have been sam- 
pled. 

Here we demonstrate that, if the number of species consid- 
ered in the species phylogeny is small compared to the total 
number of species, the overwhelming majority of gene trans- 
fers involve speciation to, and evolution along extinct or un- 
sampled species. Furthermore, we show that, if this condition 
is met, the evolution of genes along the unrepresented parts of 
the species phylogeny can to good approximation be treated 
as those of independently evolving lineages, the behaviour of 
which depends only on the global parameters of the specia- 
tion dynamics. This in turn allows us to derive the probability 
of observing a gene phylogeny by extending the ODT model 
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FIG. 1 . Gene trees are the result of the combination of speciation 
and gene birth and death. As a minimal description we consider: a) 
that for each of the N species at a rate a, a speciation occurs, during 
which the species is succeeded by two descendants, and a random 
species suffers extinction; b) at a rate S per gene, a gene dupUcates, 
i.e., it is succeeded by two gene copies in the same genome, at a rate 
t/{N — 1) per gene per host species, a gene is transferred, resulting 
in one copy each in the donor and host species, and finally with a rate 
A per gene, a gene is lost. The represented phylogeny c) corresponds 
to the tree spanned by the n sampled species. A branch of the repre- 
sented tree corresponds to a series of speciation events, but only the 
last of these, the speciation event that gives rise to two represented 
lineages (filled circles, green online) is explicitly present for internal 
branches as the speciation node terminating the branch. The number 
of unrepresented species (dashed circles) is always much larger than 
the number of represented species (full circles). 



central assumption we make is that, of the species existing 
at present (i.e., t = 0), we sample only a small fraction n <C 
N. In general, the validity of this assumption depends on the 
phylogenetic problem considered, but should almost always 
be met for major groups of bacteria and archaea, where the 
number of species that potentially exchange genes by LOT is 
inevitably much larger than the number of sampled species, 
even in large scale studies (Ochman et al., 2000; Torsvik 
et al, 2002) . 

To describe the evolution of genes within the genomes of 
species we assume genes to evolve independently according 
to a birth-and-death process that consists of gene duplication, 
transfer and loss (Tofigh, 2009; Szollosi and Daubin, 2012; 
Szollosi et al., 2012) . As shown in Fig. lb, a gene in the 
genome of any of the N species can: i) be duplicated at rate 
S; ii) be transferred from a donor species to any of the other 

— 1 possible host species at a rate r/{N — 1); or iii) be 
lost at a rate A. Genes copies can also be born and be lost 
as a result of the speciation dynamics: iv) at the species level 
lineages experience speciation at a rate a, in which case they 
are replaced by two copies in the two new species, or v) suffer 
extinction at an identical rate a. A branch e of the represented 
tree S in general corresponds to a series of speciation events, 
however, as shown in Fig.lc, only the last one of these, the 
speciation event that gave rise to two represented lineages, is 
explicitly present for internal branches as the (green online) 
speciation node terminating the branch. 



ALMOST ALL TRANSFERS INVOLVE SPECIATION 



introduced previously (Szollosi et al., 2012) . Applying our 
model to a dataset derived from 36 cyanobacterial species, we 
perform a preliminary assessment of the phylogenetic signal 
for the evolution of transferred genes along extinct species. 



A MINIMAL MODEL OF SPECIATION AND GENE 
BIRTH AND DEATH 

It is not feasible to specify, much less to reconstruct, the 
complete phylogeny of all species that ever existed. To de- 
scribe the evolution of genes outside the represented phy- 
logeny - along lineages that have become extinct or whose 
descendants have not been sampled - we must resort to mod- 
elling the speciation dynamics that gave rise to the complete 
phylogeny. Modelling the dynamics of speciation provides a 
stochastic model of the evolution of unrepresented lineages 
that can be used to describe gene histories given knowledge 
of the represented phylogeny and a few global parameters. 

As a minimal model of speciation, here, we assume that 
the number of species N is constant, and that the dynamics 
of speciation is modeled by a continuous time Moran process 
(Moran, 1962) . That is, for each species at rate a, a speciation 
occurs during which the species gives rise to two descendants 
and a randomly chosen species goes extinct (cf. Fig. la). The 



To understand what fraction of transfers involves evolution 
along unrepresented species we must compare the relative rate 
of transfers that are direct transfers between branches of the 
represented phylogeny S, and indirect transfers that result in 
a gene returning to S after exiting it via speciation or transfer 
to unrepresented species. 

To compare the contribution of indirect transfers and di- 
rect transfers to observed gene histories, we consider first 
only direct transfers and indirect transfers that involve a spe- 
ciation to an unrepresented species. To describe the shape 
of the species tree generated by the Moran process intro- 
duced above, we can use the coalescent approach. Here, un- 
der Kingman's coalescent, the time to the most recent com- 
mon ancestor of the n sampled species is of the order of 
2N/a{l - 1/n) ^ 2N/a (Kingman, 1982) . This implies 
that the expected number of unrepresented speciation events 
per branch of the species tree is much larger than one, being 
of the order of a x 2N/a/{2n - 2) ^ N/n > 1, as there 
are (2n — 2) branches of S. This suggests that for any pair 
of coexisting branches of the represented tree, a gene that de- 
scends from one of the branches and is transferred to the other, 
is likely to have experienced a speciation event "away" from 
the represented phylogeny spanned by the n sampled species 
before being transferred back to it. 

To quantify the above argument we can compare the ex- 
pected number of transfers from branch / to branch e of the 
represented phylogeny, resulting from either a direct trans- 
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FIG. 2. The overwhelming majority of transfers involve evolution 
along unrepresented species. A direct transfers (dark grey, blue on- 
line) between two terminal branches of the represented phylogeny 
occurs with rate -^^^ and involves a single transfer event. An indi- 
rect transfer (light grey, red online) that leaves an indistinguishable 
record in the gene tree topology. To count indirect transfers, we trace 
their history backwards in time: transfer back to the host branch on 



the represented tree (branch e) occur with a rate 



from each of 



the — n unrepresented species, of these we are only concerned 
with ones which descend from the relevant donor branch (branch /), 
the number of these can be calculated using the exponential coales- 
cence probability and the rate of unrepresented speciations ^ from 
the donor branch (branch /). 



fer or a more complex history involving a speciation event. 
Clearly, if the branches do not overlap in time, the expected 
number of direct transfers is zero. To consider overlapping 
branches let us consider for simplicity that both e and / are 
terminal branches - similar results can be derived for any 
other pair of overlapping branches. The expected branch 
lengths are then E{te) = E{tf) ^ N/an, with overlap 
mm{te ^tf) < N/ an. Integrating over possible transfer times, 
the expected number of direct transfers is then 
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To estimate the expected number of indirect transfers that 
are topologically indistinguishable from the above direct 
transfers we can reason backwards in time as illustrated in 
Fig. 2: i) the rate at which a transfer occurs from each of the 
{N — n) unrepresented species to branch e is r/(A/' — 1); ii) 
the probability of this gene lineage not coalescing back to any 
of the n branches of the represented tree during a time inter- 
val t is ex.-p{—na/Nt), and iii) the rate at which it coalesces 
with branch / is a/N. Integrating over possible speciation 
and transfer times gives: 
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Equations (1) and (2) show that if the number of sampled 
species is small compared to total number of species (n ^ 
N), then the expected number of direct transfers is small com- 
pared to indistinguishable indirect ones (Tdirect ^indirect), 
i.e., the contribution of direct transfers to observed gene his- 
tories is negligible. 

To compare the two types of possible indirect transfers back 
to 6* - those exiting via speciation and those via transfer - we 
must contrast the rate a at which gene copies exit branch / as 
a result of speciation and the rate r / {N — l)x{N — l — n) ^ r 
that gene copies exit as a result of transfer. Estimates of r, and 
more generally gene birth and death rates, are available from 
several sources, all of which agree that the expected number 
of gene birth and death events per branch is below unity. Mod- 
els that consider the dynamics of the number of homologous 
gene copies along a species phylogeny (referred to as phyloge- 
netic profiles) (Csuros and Miklos, 2009) have consistently 
found that birth and death rate is of the same order, with an 
excess of loss compensated by origination of new families, in 
agreement with phenomenological models of gene family size 
distribution (Karev et al., 2002; Szollosi and Daubin, 2012) 
. In a detailed study, Csuros et al. found for 28 archaea that 
the expected number of birth events (duplication and gain) is 

0. 12 and that the expected number of losses is 0.36 (Csuros 
and Miklos, 2009) per branch per gene. More recently, the 
ODT model that attempts to explicitly explain the evolution of 
multi-copy gene trees (representative of complete genomes) 
along an ultrametric species tree has arrived at similar results 
(Szollosi et al., 2012) , finding for 36 cyanobacterial genomes 
(5?^r^0.2, A^l,in units corresponding to a tree with unit 
height. Assuming, as above, that the time to the most recent 
common ancestor of the sampled species is of the order 2N/ a, 

1. e., the expected number of gene copies (per gene) exiting a 
branch of S is proportional to N/ n, while the number exiting 
as a result of transfer is less than one. Since the rate at which 
a gene that has exited the represented phylogeny returns to S 
as a results of transfer at some point in the future is indepen- 
dent of the mode of exit from S, we can conclude that indirect 
transfers are dominated by paths that include a speciation. 

In summary, if the number of sampled species is small com- 
pared to the total number of species, transfers in observed 
gene histories are dominated by paths that include a specia- 
tion to an unrepresented species and subsequent transfer back 
to the represented tree. 



THE PROBABILITY OF OBSERVING A GENE TREE 

Reconciling gene trees with the species tree requires iter- 
ating over possible paths along which a gene tree may have 
been generated by a series of speciations, duplications, trans- 
fers and losses (Fig. 3). In existing methods (Tofigh, 2009; 
Doyon et al., 2010; Szollosi and Daubin, 2012; Szollosi et al., 
2012) , this is accomplished by only considering paths along 
the represented phylogeny and using a dynamic programming 
approach exploiting the independence of gene birth and death 
events, and by extension gene lineages. 

While gene duplication, transfer and loss can reasonably 
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FIG. 3. Reconciling gene trees with the complete phylogeny. a) shows an evolutionary scenario that involves a transfer event from an 
unrepresented species. The represented phylogeny is shown as a solid tube with filled circles (green online) corresponding to represented 
speciations. The unrepresented phylogeny is indicated by dashed tubes, with white circles corresponding to unrepresented speciations (cf. 
Fig.lc). The continuous line traces the gene tree spanned by genes in sampled species that is the result of a series of birth and death events 
along the complete phylogeny. b) a reconciliation of the gene phylogeny from (a), corresponding to the evolutionary scenario depicted in (a). 
In general we do not know the evolutionary scenario that has generated the gene phylogeny. However, we can use the dynamic programming 
algorithm described in the text to calculate the likelihood of the gene tree by summing over all possible reconciliations, i.e., all ways to draw 
the gene tree into the species using speciation, duplication, transfer and loss events (cf. Eqs.4-7 and Fig.Al) in the Appendix. The likelihood 
calculation uses the rate of different events (a, r and A) together with functions describing the extinction (Ee and E) and the propagation 
(Ge and G) of gene linages (cf. Eqs.A2-A5). 



be modeled as independent birth and death events, speciation 
and extinction necessarily involve the simultaneous birth and 
death of many genes. Along the represented phylogeny, spe- 
ciation events are fully specified and can be explicitly taken 
into account (Szollosi et al., 2012) . This is not the case, 
however, for speciation and extinction events that occur in 
the unrepresented part of the phylogeny, or do not correspond 
to speciation nodes of the represented phylogeny. Therefore, 
unrepresented speciations result in non-independence of gene 
lineages. 

Consider for instance the probability Ek{t) that k genes 
present at time t in a species not ancestral to the sample 
of n extant species leave no observed descendant. Con- 
ditional on the complete phylogeny, (j) including all extinct 
species lineages, gene lineages are independent, and therefore 
Ek{t\(j)) = {E(t\(j))}^. Averaging over all complete phylo- 
genies compatible with the phylogeny reconstructed based on 
the n species, however, results in {Ek{t\(j))) = {{E{t\(l))}^) ^ 
{Ek{t\(j)))^ , which is not a product of k independent factors. 

On the other hand, n <^ N implies that E{t) ^ 1. Intro- 



ducing the notation E{t\(t)) = 1 - e(t|(/)) and E{t) = 1 - e(t), 
and neglecting second and higher order terms in e{t\(})) and 
e{t) we have: 

Suit) = {E,m)^ = { {Em}' )^ = ({1 - em}^)^ 

^ ({1 - kem})^ = {l- k{em)^} = 1 - ke{t) 

^{l-c{t)f = {E{t)]\ (3) 

A similar argument can be derived for /c-gene propagator 
Gk{Sjt) (see Appendix). Therefore, if n <C A/", then to 
good approximation, the evolution of two genes observed in 
the same unrepresented species can be treated as independent 
without specifying the full phylogeny. 

Under the above assumption that unrepresented speciation 
and extinction events can be considered in a gene-wise inde- 
pendent manner, we can describe the evolution of gene copies 
that appear as single gene lineages when observed from the 
present. We can calculate: i) the extinction probability Ee{t) 
that a gene seen at time t on branch e of 5* leaves no ob- 
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served descendant i.e., no descendant exists at time t = 
in the genome of any of the n sampled species; ii) the extinc- 
tion probabihty E{t) that a gene seen at time t in an unrepre- 
sented species leaves no observed descendant; iii) the single 
gene propagation probabilities Ge{s^t) that all observed de- 
scendants of a gene seen at time s on branch e descend from 
a descendant seen at a later time t < s on branch e; and iv) 
(5(5, t) the probability that all observed descendants of a gene 
seen at time s in an unrepresented species descend from a de- 
scendant seen at time t < 5 in an unrepresented species. Each 
of the above functions can be expressed as differential equa- 
tions describing evolution backwards in time by considering 
the set of possible events that change the relevant probability. 
These can be derived analogously to (Tofigh, 2009; Stadler, 
201 1 ; SzoUosi et al., 2012) and can be found in the Appendix. 

Given a rooted gene tree topology G we can now calculate 
the probability p{G\S^ M) of observing G, where M denotes 
the parameters of the model, by summing over all possible 
paths along S and over all complete phylogenies compatible 
with the species tree spanning the n species of the sample. We 
can sum over all paths by recursively mapping the branches 
of G onto branches of S generalising the ODT models algo- 
rithm (Szollosi et al., 2012) to include evolution along unrep- 
resented species (cf. Fig. 3 and Al in the Appendix). 

A branch of G represents the evolution of a gene copy for 
which i) if the branch is nonterminal, all observed descendants 
descend from one of the two daughter gene lineages which 
emerge from the gene tree node in which the branch termi- 
nates, or ii) if the branch is terminal, a gene is observed in one 
of the genomes mapping to a leaf of S. To describe possible 
paths along S that this gene copy may take before arriving at 
the gene tree node in which it terminates, we must consider 
five events: i) single-copy evolution along branch e of 5* de- 
scribed by Ge, ii) single-copy evolution outside S described 
by (5; iii) speciation from a branch of S to an unrepresented 
species such that only descendants of this copy are observed; 
iv) transfer such that only descendants of the transferred copy 
are observed and v) speciation represented in S such that only 
one of the descending copies leaves an observed descendant. 
Each of these events leads to a single gene copy with observed 
descendants. The gene tree node in which the branch termi- 
nates can correspond to three possible events i) a duplication; 
a speciation represented in 5*; ii) a speciation not represented 
in S'; or iii) a transfer. Each of these events leads to two gene 
copies with observed descendants. 

To derive the recursion expressing the probability of 
G as the sum over possible paths along S we discretize 
time along S keeping track of speciation times U along 
S. Speciations represented in S define the time inter- 
vals [0, ti ),..., [ti, ti+i), . . . [tn-i, tn-i) referred to as time 
slices (Tofigh, 2009; Doyon et al., 2010) with indices 
0, . . . , i, . . . n. We further divide each time slice into D equal 
time intervals of height Ati = (t^+i — ti)/D. 

The probability of the gene lineage leading to node u of G 
being seen on branch e of at time t -\- At given the proba- 



bilities at time t = U -\- Ati is 

Pe{u, t + Ati) =Ge{t + At, t)Pe{u, t) (4) 
^{5Ati}Pe{v,t)Pe{w,t) 

^{aAti}P{v,t)Pe{w,t) 

^{(jAti}Pe{v,t)P{w,t) 

-^{(jAt^}P{u,t)Ee{t), 

where P{u^t) denotes the probability of the gene lineage lead- 
ing to node u of G being seen in an unrepresented species at 
time t, V and w descend from u'mG. As shown in Fig. Ala 
in the Appendix, the terms correspond to i) no event with an 
observed descendent; ii) birth of two gene linages by dupli- 
cation, such that both leave observed descendants; iii) and iv) 
birth of two gene linages with observed descendants as a result 
of an unrepresented speciation; and finally, v) unrepresented 
speciation followed by the loss of the copy in branch e such 
that only the copy in the unrepresented phylogeny leaves an 
observed descendant. In the above expression we only con- 
sider indirect transfers that involve a speciation, see the Ap- 
pendix for the full expression. 

The probability of being seen in such an unrepresented 
species is: 

P{u, t + Ati) = G{t + Ati, t)P{u, t) (5) 

+ { + ^ + ^W^^^^^' } ^^""^ ^) 

+ E|^|^(^'^)^e(^,t) 

where £i (S) denotes the set of branches of S in time slice 
i. As shown in Fig. A lb, the terms correspond to i) no event 
with an observed descendent; ii) birth of two gene linages by 
speciation, duplication or transfer, such that both leave ob- 
served descendants; iii) and iv) birth of two gene linages with 
observed descendants as a result of transfer back to the rep- 
resented phylogeny; and finally, v) transfer back to the rep- 
resented phylogeny following which the copy in the unrepre- 
sented donor linage does not leave an observed descendant. 
Terms involving gene lineages w are zero if ?i is a leaf of G 
in both the above expressions. 

At speciation times t = ti where branches / and g descend 
from e in 5, a represented speciation takes place that may be 
followed by a loss: 

Pe{u, t) = Pf{v, t)Pg{w, t) + Pf{W, t)Pg{V, t) (6) 
^Pf{u,t)Eg{t)^Ef{t)Pg{u,t). 

The terms (cf. Fig.Alc) correspond to i) and ii) represented 
speciation such that both resulting gene lineages lead to ob- 
served descendants; and iii) and iv) represented speciation 
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such that only one of them do. Finally at time t = on 
each terminal branch e of 6* the presence of observed genes 
is expressed as: 



PeKO) 



1 if is a leaf of G found in e 
otherwise 



(7) 



As illustrated in figures 3b and Al each term in equations 4- 
7 above corresponds to a series of speciation, duplication and 
transfer events that recursively draw the gene phylogeny into 
the species tree. The recursion calculates the probability of a 
gene tree with m genes in 0{Dn^m) steps, as there are fewer 
than n branches in each time slice and n time slices. Summing 
over roots of G can be accomplished with identical complex- 
ity using double recursion. The most likely reconciliation can 
be recovered by tracing back along the sum choosing at each 
step the event with the highest probability. 

Calculating the probability of a gene tree requires knowl- 
edge of the ultrametric species tree with branch lengths 
corresponding to time, the rate of duplication 5, transfer r 
and loss A, as well as the parameters of the speciation dy- 
namics, the species replacement rate cr and the total number 
of species N. The number of parameters is reduced, if we as- 
sume the time to the common ancestor of the sampled species 
to correspond to its expected value under speciation dynamics. 
Choosing units such that S is of unit height this corresponds 
to the choice cr = 27V. Furthermore, under the present choice 
of parameters and time scale, the probability of a gene tree 
and its maximum likelihood reconciliation depends only very 
weakly on N, as long as the condition n <C A/" is satisfied. 
This is the case because the expected number of transfers be- 
tween branches of S is nearly independent of A^. In particular 
if we assume that a gene lineage returns at most once to S we 
arrive at the result derived in equation 2 according to which 
the number of transfers is independent of A^. 
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FIG. 4. Lateral gene transfer events for 36 cyanobacteria. For 473 

near universal single-copy families from 36 cyanobacterial genomes 
gene trees that maximize the joint likelihood were reconstructed. For 
the trees obtained 1000 reconciliations were sampled, a) shows the 
distribution of transfer events (light bars, green online) and the pre- 
ceding speciation events (dark bars, blue online). The final bin sum- 
marizes all events occurring above the root of S. b) shows the distri- 
bution of the time spent by transferred genes evolving along unrepre- 
sented species for transfers between overlapping branches (dark bars, 
red online, 72.2% of transfers) and transfers between nonoverlapping 
branches (light bars, yellow online, 27.8% of all transfers). Both sets 
of bins sum to unity. Time units are chosen such that the height of 
the root of S is 1.0. The age of the root falls in the 3500 - 2700 Mya 
interval (Falcon et al., 2010; Szollosi et al., 2012) . Data is available 
from Dryad under doi:10.5061/dryad.27d0g. 



ROUTES TO CYANOBACTERIAL GENOMES 

To carry out a preliminary analysis of the signal for evolu- 
tion outside the represented phylogeny in real data, we con- 
sidered a set of 473 single-copy gene families present in the 
genome of at least 34 of 36 cyanobacteria and use the dated 
species tree reconstructed in (Szollosi et al., 2012) . We 
choose single-copy near universal gene families as they are 
expected to be i) relatively slowly evolving and hence to har- 
bor a strong signal of homology and yield high quality align- 
ments, and ii) they can be assumed to be well described by 
a single set of uniform duplication, transfer and loss rates, at 
least in contrast to more complex datasets composed of multi- 
copy families. For each family, gene tree topologies and du- 
plication, transfer and loss rates that maximize the joint like- 
lihood (Maddison, 1997; Szollosi and Daubin, 2012) were 
inferred as described in the Appendix. Using these results 
1000 reconciliations per family were sampled by stochastic 
backtracking along the sum over reconciliations. 

On average we found duplications, 2.15 transfers and 
2.56 losses per family. The distribution in time of transfer 
events and the preceding speciations to unrepresented species 



are shown in Fig.4a. The majority of transfers occur between 
branches of S that overlap in time, hence the resulting gene 
tree carries no topological signature of the length of time spent 
evolving along unrepresented lineages. Transfers between 
branches nLateral gene transfer events for 36 cyanobacte- 
ria. For 473 near universal single-that do not overlap in time, 
for which the gene tree topologies explicitly record evolution 
outside the represented tree, correspond to 27.8% of all trans- 
fers. About a fifth of these (5.9% of all transfers) branch above 
the root indicating transfer from outside the sampled diversity 
of cyanobacteria. The median interval of time spent evolv- 
ing in unrepresented lineages is 0.083 (or 222 million years, 
hence forth myr) for transfers between overlapping branches 
and 0.39 (or 1000 myr) for transfers between nonoverlapping 
branches. Similar values are obtained if we consider only the 
maximum likelihood reconciliations, except for the median 
interval of time spent evolving in unrepresented lineage for 
transfers between overlapping branches which is only 0.0028 
(or 8.1 myr corresponding to the minimum length allowed by 
time discretization). The corresponding value for transfers be- 
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tween nonoverlapping branches, 0.36 ( or 990 myr), is nearly 
identical to the value above. 

We emphasize that an important caveat of these results is 
that the accuracy of our method to infer correct reconcilia- 
tions and gene topologies has not been assessed. This could 
be accomplished by explicit simulations of gene family evo- 
lution along the complete phylogeny. Such simulations are, 
however, outside the scope of the current publication, as they 
are technically challenging due to the large number of species 
in the complete phylogeny, and since they must address a po- 
tentially long list of possible questions. In lieu of simulation 
it is possible to examine the posterior support of individual 
transfer events, which, as described in the appendix, can be 
calculated as the fraction of times we find a given transfer 
event among the sampled reconciliations for each family. Us- 
ing this measure, we find that transfers are well supported with 
66.8% of transfer events having support over 0.95. 

It is also important to discuss to what extent we can ex- 
pect observed transfers between nonoverlapping branches to 
be robust to increasing the number of sampled species. Con- 
sider the extreme case that all extant species are sampled. 
It is clear that transfers between overlapping branches of S 
(red in Fig.4b) may correspond to transfer between nonover- 
lapping branches of the full phylogeny spanned by all N ex- 
tant species. To ascertain how often we expect the opposite 
to occur, to have a transfer between nonoverlapping branches 
of S correspond to transfers between overlapping branches of 
the full phylogeny spanned by all TV extant species, we need 
to estimate how often we expect to sample an extant descen- 
dant of the unrepresented donor lineage involved in a transfer 
between nonoverlapping branches of S (light bars, yellow on- 
line, in Fig.4b). Assuming a tree with unit height the total 
branch length of the full phylogeny under Kingman's coales- 
cent is of the order of \og{N), while the total branch lengths 
including extinct species is of the order N. Thus, we ex- 
pect that only a vanishing fraction of the order log(A^)/A^ of 
donor lineages have left extant descendants. This implies that 
not only do most transfers involve speciation to, and evolution 
along branches of the complete phylogeny, but the majority of 
these donor lineages have gone extinct. Consequently, most 
transfers between nonoverlapping branches of S correspond 
to transfers between nonoverlapping branches of the full phy- 
logeny where the donor lineage has gone extinct. 

In sunmiary, we find that nearly a third - 27.8% - of trans- 
fers evolve on average a billion years along lineages unrep- 
resented in the phylogeny - most often, in fact, along extinct 
lineages, and only a moderate fraction of transfers originate 
from outside the cyanobacteria. Furthermore, both of these 
estimates are conservative, as increasing the number of sam- 
pled species is expected to lead to an increase in the ratio of 
transfers between nonoverlapping branches, and to a decrease 
in the fraction of transfers from outside of cyanobacteria. The 
first of the above results, however, applies only to transfers 
between branches of S, i.e., transfers observed for the n = 36 
cyanobacteria considered. For the complete set of transfers 
between branches of the full phylogeny the fraction of trans- 
fers evolving along extinct linages is potentially different, e.g. 
a macroscopic fraction of transfers are expected to correspond 



to direct transfers between its branches. 



DISCUSSION 

The results developed above are conditional on two crucial 
assumptions: i) that the number of sampled species is small 
compared to the total number of species, and ii) the evolution 
of gene lineages can be treated as independent, both in the 
represented and the unrepresented part of the phylogeny. As 
we argue above, if genes are duplicated, transferred and lost 
independently, the former assumption (i.e., n <C iV) implies 
that the evolution of genes outside the represented phylogeny 
can also be treated as independent, even if the complete phy- 
logeny is not specified. 

We also make the assumptions that iii) transfer occurs with 
identical rate between any two species and iv) that the time to 
the last common ancestor of the sampled species corresponds 
to its expected value under the speciation dynamics. These 
conditions serve to simplify the development of the above ar- 
guments and can be relaxed without affecting our conclusion 
that the majority of transfers involve evolution along extinct 
or unsampled species. Relaxing condition iv is straightfor- 
ward. Concerning assumption iii, if, for example, transfer 
occurs preferentially between species that are more closely 
related (Andam and Gogarten, 2011) , the scenarios shown 
in figure 2 are affected to an identical extent because the 
last common ancestor of branch e and either branch / (the 
donor lineage for dark grey paths, blue online) or any extinct 
species that descends from an unrepresented speciation along 
/ (a donor lineage along light grey paths, red online) is the 
same. Conversely, there are known cases, e.g. the transfer 
of thermostable enzymes from thermophilic archaea to ther- 
mophilic bacteria (Nelson et al., 1999; Nesbo et al., 2001; 
Brochier-Armanet and Forterre, 2007) , of preferential trans- 
fer between distantly related taxa due to shared ecology. In 
this second case, we expect to observe genes preferentially 
transferred from phylogenetically distant taxa to lead to an ex- 
cess of transfers descending from above the root of the sam- 
pled species for which topologically equivalent direct trans- 
fers do not exist. On a more practical ground, however, relax- 
ing the assumption of homogeneous rates of transfer between 
lineages might seriously complicate the computation of the 
likelihood, as it would require modelling the distribution of 
the rates of transfers from and to unrepresented lineages. 

More importantly, as long as these conditions are met, it is 
possible to extend the above results to more general models 
of speciation. Modelling variation in N, the total number of 
species, over geological times, could be of particular interest. 
Indeed, a corollary of the observation that LGT events record 
evolutionary paths along the complete species tree is that the 
phylogenies of genes from a limited sample of extant species 
carry information about extinct lineages, and therefore about 
the size and dynamics of ancient biodiversity. In fact, patterns 
of gene transfer may be even more informative about past bio- 
diversity than the species tree itself. Drawing an analogy with 
population genetics, inferring biodiversity dynamics based on 
species trees (Nee, 2001; Morion et al., 2010; Stadler, 2011) 
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is similar to inferring past demography based on single-locus 
data. Single-locus inference is limited by the intrinsic stochas- 
ticity of Kingman's coalescent, in particular in the deep part 
of the genealogy. Lateral gene transfers, on the other hand, 
are analogous to multiple loci (Heled and Drummond, 2008) , 
and as such, have the potential to increase the statistical power 
for inferring past biodiversity. 
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APPENDIX 

The dynamic programming algorithm described in equa- 
tions 4-7 calculates the likelihood of a gene tree given the 
species tree S and the rates of speciation, duplication, transfer 
and loss. As illustrated in Fig. 3, the likelihood is calculated in 
a piece- wise independent manner, from the evolution of gene 
copies that appear as single gene lineages when observed from 
the present. Since in contrast to Fig. 3 we do not know the 
exact evolutionary scenario we must sum over all reconcilia- 
tions. This process can be represented as summing over all 
possible ways to draw the gene tree into the species tree using 
a the set of events shown in Fig.Al. The diagrams in Fig.Al, 
and the corresponding terms in equations 4-7, are expressed 
using two types of functions describing the evolution of sin- 
gle gene lineages: i) the extinction probabilities and E that 
give the probability of gene present on, respectively, branch e 
of S or an unrepresented species having no descendant at time 
t = in the genome of any of the n sampled species; ii) the 
single gene propagators Ge{s^ t) and (5(5, t) corresponding to 
the probability that all sampled descendants of the gene seen 
at time 5, respectively, on branch e of S, or in an unrepre- 
sented species, descend from the gene present at a later time t 
in the same species. 

Below we provide the expressions for each of these func- 
tions that can be derived using the theory of birth-and-death 
processes. We also discuss the independence assumption in 
relation to the single gene propagators, write down the com- 
plete form of equation 4 and describe the details of the data 
analysis presented in the main text. 



Evolution of single genes 

The forward Kolmogorov equations describing single gene 
extinction and propagation can be derived analogously to 
(Tofigh, 2009; Stadler, 2011; Szollosi et al., 2012) . The 
main differences is that here we also consider the speciation 
dynamics. 

The extinction probability for branch e of S\ 



,E, 



+ {A}(l-^e) 
-{8{l-Ee)}Ee 



(A2) 



iftf^ = 0, 

Ej(tbegin)^^(^begin^ Otherwise ' 



where Si{S) denotes the set of branches of S in time slice i, 
and rii their number. The terms correspond to i) loss ii) the 
rate duplications and iii) transfers to represented hosts, both 
conditional on survival, and finally iv) the rate of unrepre- 
sented speciations and transfers to unrepresented hosts, again 



conditional on survival. The initial conditions specify that at 
the end of branch e the probability of extinction is if we are 
at time t = i.e., e is a terminal branch of S, and the prod- 
uct of the extinction probability of the descendants of e in 5 
otherwise. 



The extinction probability in an unrepresented species: 
d 



dt 



E = ^{a^X}{l-E) 

-[{^^S^^^r){l-E)}E 

-{ E^iv^c-^ 



(A3) 



^(0) =1, 

Note that the term corresponding to transfer back to S acts as 
an inhomogeneity in the absence of which the only solution is 
E{t) = 1. 

The single observed lineage propagator along branch e of 

S: 



dt 



(A4) 



Ge 



Ge = -{X + S{l-2Ee)}Ge 

-{ E 

-{a{l-E)}Ge 

Ge{t,t) = l. 

The single observed lineage propagator in an unrepresented 
species: 



dt 



G = -{(T + \}G 

~l{cT + 5+^^T)il-2E)}G 



(A5) 



N-1 



{ E ^a-^/)} 

[feSi(S) ) 



G 



G{t,t) =1. 

Note that if we set E{t) = 1 and neglect gene birth and 
death, which is much slower than the speciation dynamics - 
i.e., (5 + T + A <C cr, we recover the exponential probability of 
coalescence with the represented tree assumed in equation 2. 

The propagator describing the evolution of k gene copies 
can be expressed using the single gene copy propagator. Con- 
sider the expression for Gk{s, t), the probability that k genes 
seen at time s in an unrepresented species all leave a single 
descendant descending from the copy seen at time t < s in an 
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FIG. Al. Diagrams corresponding to reconciliation events. Each diagram corresponds to a term in equations 4-7, with diagrams following 
each other in the same order as terms in the indicated equation, a) depicts events that start with a gene lineage u in represented branch e of S 
at time t + At; b) events which start with a gene lineage u in an unrepresented species at time t + At; finally c) corresponds to represented 
speciation events in 5*. To illustrate the correspondence between terms and equations consider the third diagram in the top row (a) depicting an 
unrepresented speciation and the corresponding (third) term in equation 4. This term, Pe{u^ t + At) = • • • + {aAt}Pe{v, t)P{w, t) + • • • , 
describes the probability that gene lineage u seen at time t + At is succeeded as a result of an unrepresented speciation by two gene linages (v 
and w) one of which (w) is present in the same branch e as it while the other (v) resides in an unrepresented species. 



unrepresented species: 2(1 — {E}^) c:^ 2k {1 — E) and the above can be written as 



^Q^ = -{a + k\}Gk (A6) ^Gfc^-A:{(7 + A}Gfc (A7) 

- [ail - 2E,) + k(S + ^r)(l -2E)^G, " ^ {-d " 2^) + + " 2^)} 



G(t,t)fc =1. 



which has the solution Gk = {G}^. Analogous reasoning can 
be used to show that Ee{t) and Ge{s^ t) can be used to factor 

Since 3 implies —crGk — cr{l — 2Ek)Gk — cr2(l — {E}^)Gk the respective functions describing the evolution of multiple 

and neglecting second and higher order terms in 1 — ^ gives gene copies. 
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The probability of a gene tree 

The expressions for Pe{u^ t) and P{u^ t) are derived under 
the approximation that unrepresented speciation and extinc- 
tion is independent per gene, as discussed above. The full 
expression for Pe{u, t + Mi) including terms corresponding 
to direct transfers and indirect transfers that depart S via a 
preceding transfer events, which are neglected in equation 4, 
is: 

Pe{u, t + Ati) =Ge{t + At, t)Pe{u, t) (A8) 
+ {5Ati}Pe{v,t)Pe{w,t) 

feSi ^ ^ 

+ + ^3Y^)^^^} Ee{t)P{u,t), 

Routes to cyanobacterial genomes 

The dataset was constructed for near universal single-copy 
genes from all 36 cyanobacterial genomes found in version 5 
the HOGENOM database (Penel et al, 2009) . Amino acid se- 
quences were extracted for each family that had a single-copy 
in at least 34 of the 36 cyanobacterial genomes. For each fam- 
ily sequences were aligned using MUSCLE (Edgar, 2004) 
with default parameters. The multiple alignment was subse- 
quently cleaned using GBLOCKS (Talavera and Castresana, 
2007) with the options: 

"-t=p -bl 50 -b2 50 -b5=a -t=p". (A9) 

Subsequently we inferred a gene topology G that maximizes 
the joint likelihood 

PexODT(G|5', 5, r, A, cr, TV) x ppeisenstein (alignment |G), 

(AlO) 



where the first term corresponds to the likelihood of observ- 
ing the unrooted gene tree topology G according to the ex- 
ODT model developed above (equations 4 and 5), while the 
second term corresponds to the classic Felsenstein likelihood 
(Felsenstein, 1981) of the alignment. For the exODT model 
we fixed the parameter values = 10^ and a = 2N, used the 
dated phylogeny from (Szollosi et al., 2012) and estimated 
global gene birth and death rates as described below. To cal- 
culate the Felsenstein likelihood we used the Bio++ library 
(Dutheil et al., 2006) with an LG+r4+I model. Ahgnments 
and reconstructed gene trees are available from Dryad under 
doi:10.5061/dryad.27d0g. 

Gene trees inference was performed in a two step approach: 

Initial estimate: 

1. using the DTL rates ^ = 1.0 x 10"^, r = 1.0 x 
10"^, A = 2.0 X 10"^ for each family the joint 
likelihood was calculated for all nearest neighbor 
interchanges (Felsenstein, 2004) (NNIs) and a 
move was accepted if it improved the joint likeli- 
hood. 

2. for the set of trees obtained global DTL parame- 
ters were estimated that maximize the product of 
the joint likelihood of all 473 gene families. 

Final estimate: 

1. using the obtained DTL rates for each family 
the joint likelihood was calculated for all near- 
est neighbor interchanges (Felsenstein, 2004) 
(NNIs) and a move was accepted if it improved 
the joint likelihood. 

2. for the set of trees obtained global DTL param- 
eters were again estimated with the results: S = 
1.010 X 10-^ T = 4.438 X 10-^A = 1.015 x 
10-1. 

Before performing NNIs starting gene tree topologies were 
estimated using an amalgamation approach (David and Aim, 

2011) wherein the Felsenstein likelihood was approximated 
using conditional clade probabilities (Hohna and Drummond, 

2012) based on posterior sample of 10000 tree topologies 
obtained using PhyloBayes using an LG+r4+I substitution 
model. 

The support of transfer events was measured based on a 
posterior sample of 1000 reconciliations per family. For each 
family we assessed the support of all transfer events in the 
reconciliation that was seen the largest number of times. Two 
transfers were considered equivalent if they involved the trans- 
fer of the same gene linage between identical branches of the 
species tree. 
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